Towards Better Monolingual Japanese Retrievers with Multi-Vector Models

Clavié, Benjamin

Computer Science > Computation and Language

arXiv:2312.16144 (cs)

[Submitted on 26 Dec 2023 (v1), last revised 23 Sep 2024 (this version, v2)]

Title:Towards Better Monolingual Japanese Retrievers with Multi-Vector Models

Authors:Benjamin Clavié

View PDF HTML (experimental)

Abstract:As language-specific training data tends to be sparsely available compared to English, document retrieval in many languages has been largely relying on multilingual models. In Japanese, the best performing deep-learning based retrieval approaches rely on multilingual dense embedders, with Japanese-only models lagging far behind. However, multilingual models require considerably more compute and data to train and have higher computational and memory requirements while often missing out on culturally-relevant information. In this paper, we introduce JaColBERT, a family of multi-vector retrievers trained on two magnitudes fewer data than their multilingual counterparts while reaching competitive performance. Our strongest model largely outperform all existing monolingual Japanese retrievers on all dataset, as well as the strongest existing multilingual models on all out-of-domain tasks, highlighting the need for specialised models able to handle linguistic specificities. These results are achieved using a model with only 110 million parameters, considerably smaller than all multilingual models, and using only a limited Japanese-language. We believe our results show great promise to support Japanese retrieval-enhanced application pipelines in a wide variety of domains.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2312.16144 [cs.CL]
	(or arXiv:2312.16144v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2312.16144

Submission history

From: Benjamin Clavié [view email]
[v1] Tue, 26 Dec 2023 18:07:05 UTC (313 KB)
[v2] Mon, 23 Sep 2024 02:51:31 UTC (522 KB)

Computer Science > Computation and Language

Title:Towards Better Monolingual Japanese Retrievers with Multi-Vector Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Towards Better Monolingual Japanese Retrievers with Multi-Vector Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators